{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ssw34RVKHmsJ"
   },
   "source": [
    "# Advanced Document Parsing For Enterprises"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "b2oYelmWW35Q"
   },
   "source": [
    "## Introduction\n",
    "\n",
    "The bread and butter of natural language processing technology is text. Once we can reduce a set of data into text, we can do all kinds of things with it: question answering, summarization, classification, sentiment analysis, searching and indexing, and more.\n",
    "<br>\n",
    "<br>\n",
    "In the context of enterprise Retrieval Augmented Generation (RAG), the information is often locked in complex file types such as PDFs. These formats are made for sharing information between humans, but not so much with language models.\n",
    "<br>\n",
    "<br>\n",
    "In this notebook, we will use a real-world pharmaceutical drug label to test out various performant approaches to parsing PDFs. This will allow us to use [Cohere's Command-R model](https://txt.cohere.com/command-r/) in a RAG setting to answer questions and asks about this label, such as \"I need a succinct summary of the compound name, indication, route of administration, and mechanism of action of\" a given pharmaceutical.\n",
    "<br>\n",
    "<br>\n",
    "![image.png]()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "f-mq1FCojI2p"
   },
   "source": [
    "<a name=\"top\"></a>\n",
    "## PDF Parsing\n",
    "\n",
    "We will go over five proprietary as well as open source options for processing PDFs. The parsing mechanisms demonstrated in the following sections are\n",
    "- [Google Document AI](#gcp)\n",
    "- [AWS Textract](#aws)\n",
    "- [Unstructured.io](#unstructured)\n",
    "- [LlamaParse](#llama)\n",
    "- [pdf2image + pytesseract](#pdf2image)\n",
    "\n",
    "By way of example, we will be parsing a [21-page PDF](https://www.accessdata.fda.gov/drugsatfda_docs/label/2023/215500s000lbl.pdf) containing the label for a recent FDA drug approval, the beginning of which is shown below. Then, we will perform a series of basic RAG tasks with our different parsings and evaluate their performance.\n",
    "\n",
    "![image.png]()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "lB0o4L4jh5Sv",
    "jp-MarkdownHeadingCollapsed": true
   },
   "source": [
    "## Getting Set Up\n",
    "\n",
    "Before we dive into the technical weeds, we need to set up the notebook's runtime and filesystem environments. The code cells below do the following:\n",
    "- Install required libraries\n",
    "- Confirm that data dependencies from the GitHub repo have been downloaded. These will be under `data/document-parsing` and contain the following:\n",
    "    - the PDF document that we will be working with, `fda-approved-drug.pdf` (this can also be found here: https://www.accessdata.fda.gov/drugsatfda_docs/label/2023/215500s000lbl.pdf)\n",
    "    - precomputed parsed documents for each parsing solution. While the point of this notebook is to illustrate how this is done, we provide the parsed final results to allow readers to skip ahead to the RAG section without having to set up the required infrastructure for each solution.)\n",
    "- Add utility functions needed for later sections"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "ZP1v0aF6Ij3U"
   },
   "outputs": [],
   "source": [
    "%%capture\n",
    "! sudo apt install tesseract-ocr poppler-utils\n",
    "! pip install \"cohere<5\" fsspec hnswlib google-cloud-documentai google-cloud-storage boto3 langchain-text-splitters llama_parse pytesseract pdf2image pandas\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "A5yWsKDSL_kY"
   },
   "outputs": [],
   "source": [
    "data_dir = \"data/document-parsing\"\n",
    "source_filename = \"example-drug-label\"\n",
    "extension = \"pdf\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pathlib import Path\n",
    "\n",
    "sources = [\"gcp\", \"aws\", \"unstructured-io\", \"llamaparse-text\", \"llamaparse-markdown\", \"pytesseract\"]\n",
    "\n",
    "filenames = [\"{}-parsed-fda-approved-drug.txt\".format(source) for source in sources]\n",
    "filenames.append(\"fda-approved-drug.pdf\")\n",
    "\n",
    "for filename in filenames:   \n",
    "    file_path = Path(f\"{data_dir}/{filename}\")\n",
    "    if file_path.is_file() == False:\n",
    "        print(f\"File {filename} not found at {data_dir}!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "BoM6-Tq-Sm-1"
   },
   "source": [
    "### Utility Functions\n",
    "Make sure to include the notebook's utility functions in the runtime."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "jl7BJsvdSrr5"
   },
   "outputs": [],
   "source": [
    "def store_document(path: str, doc_content: str):\n",
    "    with open(path, 'w') as f:\n",
    "      f.write(doc_content)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "R2u-drbt7SOQ"
   },
   "outputs": [],
   "source": [
    "import json\n",
    "\n",
    "def insert_citations_in_order(text, citations, documents):\n",
    "    \"\"\"\n",
    "    A helper function to pretty print citations.\n",
    "    \"\"\"\n",
    "\n",
    "    citations_reference = {}\n",
    "    for index, doc in enumerate(documents):\n",
    "        citations_reference[index] = doc\n",
    "\n",
    "    offset = 0\n",
    "    # Process citations in the order they were provided\n",
    "    for citation in citations:\n",
    "        # Adjust start/end with offset\n",
    "        start, end = citation['start'] + offset, citation['end'] + offset\n",
    "        citation_numbers = []\n",
    "        for doc_id in citation[\"document_ids\"]:\n",
    "            for citation_index, doc in citations_reference.items():\n",
    "                if doc[\"id\"] == doc_id:\n",
    "                    citation_numbers.append(citation_index)\n",
    "        references = \"(\" + \", \".join(\"[{}]\".format(num) for num in citation_numbers) + \")\"\n",
    "        modification = f'{text[start:end]} {references}'\n",
    "        # Replace the cited text with its bolded version + placeholder\n",
    "        text = text[:start] + modification + text[end:]\n",
    "        # Update the offset for subsequent replacements\n",
    "        offset += len(modification) - (end - start)\n",
    "\n",
    "    # Add the citations at the bottom of the text\n",
    "    text_with_citations = f'{text}'\n",
    "    citations_reference = [\"[{}]: {}\".format(x[\"id\"], x[\"text\"]) for x in citations_reference.values()]\n",
    "\n",
    "    return text_with_citations, \"\\n\".join(citations_reference)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "SULVG_HwKdm5"
   },
   "outputs": [],
   "source": [
    "def format_docs_for_chat(documents):\n",
    "  return [{\"id\": str(index), \"text\": x} for index, x in enumerate(documents)]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "y-VohvR3X6S6",
    "jp-MarkdownHeadingCollapsed": true
   },
   "source": [
    "## Document Parsing Solutions\n",
    "\n",
    "For demonstration purposes, we have collected and saved the parsed documents from each solution in this notebook. Skip to the [next section](#document-questions) to run RAG with Command-R on the pre-fetched versions. You can find all parsed resources in detail at the link [here](https://github.com/gchatz22/temp-cohere-resources/tree/main/data)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "jLtrGjXiJE9M"
   },
   "source": [
    "<a name=\"gcp\"></a>\n",
    "### Solution 1: Google Cloud Document AI [[Back to Solutions]](#top)\n",
    "\n",
    "Document AI helps developers create high-accuracy processors to extract, classify, and split documents.\n",
    "<br>\n",
    "External documentation: https://cloud.google.com/document-ai"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "cxwJ_jZpgNDo"
   },
   "source": [
    "#### Parsing the document\n",
    "\n",
    "The following block can be executed in one of two ways\n",
    "1. Inside a Google Vertex AI environment\n",
    "  - No authentication is needed\n",
    "2. Inside the notebook\n",
    "  - Authentication needed\n",
    "  - There are pointers inside the code on which lines to uncomment in order to achieve that\n",
    "<br>\n",
    "**Note: You can skip to the next block if you want to use the pre-existing parsed version.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "uZdjNlfxJEXv"
   },
   "outputs": [],
   "source": [
    "\"\"\"\n",
    "Extracted from https://cloud.google.com/document-ai/docs/samples/documentai-batch-process-document\n",
    "\"\"\"\n",
    "\n",
    "import re\n",
    "from typing import Optional\n",
    "\n",
    "from google.api_core.client_options import ClientOptions\n",
    "from google.api_core.exceptions import InternalServerError\n",
    "from google.api_core.exceptions import RetryError\n",
    "from google.cloud import documentai  # type: ignore\n",
    "from google.cloud import storage\n",
    "\n",
    "project_id = \"\"\n",
    "location = \"\"\n",
    "processor_id = \"\"\n",
    "gcs_output_uri = \"\"\n",
    "# credentials_file = \"populate if you are running in a non Vertex AI environment.\"\n",
    "gcs_input_prefix = \"\"\n",
    "\n",
    "\n",
    "def batch_process_documents(\n",
    "    project_id: str,\n",
    "    location: str,\n",
    "    processor_id: str,\n",
    "    gcs_output_uri: str,\n",
    "    gcs_input_prefix: str,\n",
    "    timeout: int = 400\n",
    ") -> None:\n",
    "    parsed_documents = []\n",
    "\n",
    "    # Client configs\n",
    "    opts = ClientOptions(api_endpoint=f\"{location}-documentai.googleapis.com\")\n",
    "    # With credentials\n",
    "    # opts = ClientOptions(api_endpoint=f\"{location}-documentai.googleapis.com\", credentials_file=credentials_file)\n",
    "\n",
    "    client = documentai.DocumentProcessorServiceClient(client_options=opts)\n",
    "    processor_name = client.processor_path(project_id, location, processor_id)\n",
    "\n",
    "    # Input storage configs\n",
    "    gcs_prefix = documentai.GcsPrefix(gcs_uri_prefix=gcs_input_prefix)\n",
    "    input_config = documentai.BatchDocumentsInputConfig(gcs_prefix=gcs_prefix)\n",
    "\n",
    "    # Output storage configs\n",
    "    gcs_output_config = documentai.DocumentOutputConfig.GcsOutputConfig(gcs_uri=gcs_output_uri, field_mask=None)\n",
    "    output_config = documentai.DocumentOutputConfig(gcs_output_config=gcs_output_config)\n",
    "    storage_client = storage.Client()\n",
    "    # With credentials\n",
    "    # storage_client = storage.Client.from_service_account_json(json_credentials_path=credentials_file)\n",
    "\n",
    "    # Batch process docs request\n",
    "    request = documentai.BatchProcessRequest(\n",
    "        name=processor_name,\n",
    "        input_documents=input_config,\n",
    "        document_output_config=output_config,\n",
    "    )\n",
    "\n",
    "    # batch_process_documents returns a long running operation\n",
    "    operation = client.batch_process_documents(request)\n",
    "\n",
    "    # Continually polls the operation until it is complete.\n",
    "    # This could take some time for larger files\n",
    "    try:\n",
    "        print(f\"Waiting for operation {operation.operation.name} to complete...\")\n",
    "        operation.result(timeout=timeout)\n",
    "    except (RetryError, InternalServerError) as e:\n",
    "        print(e.message)\n",
    "\n",
    "    # Get output document information from completed operation metadata\n",
    "    metadata = documentai.BatchProcessMetadata(operation.metadata)\n",
    "    if metadata.state != documentai.BatchProcessMetadata.State.SUCCEEDED:\n",
    "        raise ValueError(f\"Batch Process Failed: {metadata.state_message}\")\n",
    "\n",
    "    print(\"Output files:\")\n",
    "    # One process per Input Document\n",
    "    for process in list(metadata.individual_process_statuses):\n",
    "        matches = re.match(r\"gs://(.*?)/(.*)\", process.output_gcs_destination)\n",
    "        if not matches:\n",
    "            print(\"Could not parse output GCS destination:\", process.output_gcs_destination)\n",
    "            continue\n",
    "\n",
    "        output_bucket, output_prefix = matches.groups()\n",
    "        output_blobs = storage_client.list_blobs(output_bucket, prefix=output_prefix)\n",
    "\n",
    "        # Document AI may output multiple JSON files per source file\n",
    "        # (Large documents get split in multiple file \"versions\" doc --> parsed_doc_0 + parsed_doc_1 ...)\n",
    "        for blob in output_blobs:\n",
    "            # Document AI should only output JSON files to GCS\n",
    "            if blob.content_type != \"application/json\":\n",
    "                print(f\"Skipping non-supported file: {blob.name} - Mimetype: {blob.content_type}\")\n",
    "                continue\n",
    "\n",
    "            # Download JSON file as bytes object and convert to Document Object\n",
    "            print(f\"Fetching {blob.name}\")\n",
    "            document = documentai.Document.from_json(blob.download_as_bytes(), ignore_unknown_fields=True)\n",
    "            # Store the filename and the parsed versioned document content as a tuple\n",
    "            parsed_documents.append((blob.name.split(\"/\")[-1].split(\".\")[0], document.text))\n",
    "\n",
    "    print(\"Finished document parsing process.\")\n",
    "    return parsed_documents\n",
    "\n",
    "# Call service\n",
    "# versioned_parsed_documents = batch_process_documents(\n",
    "#     project_id=project_id,\n",
    "#     location=location,\n",
    "#     processor_id=processor_id,\n",
    "#     gcs_output_uri=gcs_output_uri,\n",
    "#     gcs_input_prefix=gcs_input_prefix\n",
    "# )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 35
    },
    "id": "CT4QwMTwLPad",
    "outputId": "1c85c3aa-a3b5-48c4-bb6e-5033c2c604a0"
   },
   "outputs": [
    {
     "data": {
      "application/vnd.google.colaboratory.intrinsic+json": {
       "type": "string"
      },
      "text/plain": [
       "'\\nPost process parsed document and store it locally.\\nMake sure to run in Google Vertex AI environment or include a credentials file.\\n'"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "\"\"\"\n",
    "Post process parsed document and store it locally.\n",
    "Make sure to run in Google Vertex AI environment or include a credentials file.\n",
    "\"\"\"\n",
    "\n",
    "# from pathlib import Path\n",
    "# from collections import defaultdict\n",
    "\n",
    "# parsed_documents = []\n",
    "# combined_versioned_parsed_documents = defaultdict(list)\n",
    "\n",
    "# # Assemble versioned documents together ({\"doc_name\": [(0, doc_content_0), (1, doc_content_1), ...]}).\n",
    "# for filename, doc_content in versioned_parsed_documents:\n",
    "#   filename, version = \"-\".join(filename.split(\"-\")[:-1]), filename.split(\"-\")[-1]\n",
    "#   combined_versioned_parsed_documents[filename].append((version, doc_content))\n",
    "\n",
    "# # Sort documents by version and join the content together.\n",
    "# for filename, docs in combined_versioned_parsed_documents.items():\n",
    "#   doc_content = \" \".join([x[1] for x in sorted(docs, key=lambda x: x[0])])\n",
    "#   parsed_documents.append((filename, doc_content))\n",
    "\n",
    "# # Store parsed documents in local storage.\n",
    "# for filename, doc_content in parsed_documents:\n",
    "#  file_path = \"{}/{}-parsed-{}.txt\".format(data_dir, \"gcp\", source_filename)\n",
    "#  store_document(file_path, doc_content)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "dmkya9saFN8R"
   },
   "source": [
    "#### Visualize the parsed document"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "S6SMCDb1FhqZ"
   },
   "outputs": [],
   "source": [
    "filename = \"gcp-parsed-{}.txt\".format(source_filename)\n",
    "with open(\"{}/{}\".format(data_dir, filename), \"r\") as doc:\n",
    "    parsed_document = doc.read()\n",
    "\n",
    "print(parsed_document[:1000])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "dlRxuddi5w0E"
   },
   "source": [
    "<a name=\"aws\"></a>\n",
    "### Solution 2: AWS Textract [[Back to Solutions]](#top)\n",
    "\n",
    "[Amazon Textract](https://aws.amazon.com/textract/) is an OCR service offered by AWS. It can detect text, forms, tables, and more in PDFs and images. In this section, we go over how to use Textract's asynchronous API.\n",
    "<br>\n",
    "<br>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "1YCszLJnge4j"
   },
   "source": [
    "#### Parsing the document"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "vFBasw782Gho"
   },
   "source": [
    "We assume you are working within the AWS ecosystem (from a SageMaker notebook, EC2 instance, a Lambda function, ...) with valid credentials. Much of the code here is from supplemental materials created by AWS and offered here:\n",
    "\n",
    "- https://github.com/awsdocs/aws-doc-sdk-examples/tree/main/python/example_code/textract\n",
    "- https://github.com/aws-samples/textract-paragraph-identification/tree/main\n",
    "\n",
    "At minimum, you will need access to the following AWS resources to get started:\n",
    "\n",
    "- Textract\n",
    "- an S3 bucket containing the document(s) to process - in this case, our `example-drug-label.pdf` file\n",
    "- an SNS topic that Textract can publish to. This is used to send a notification that parsing is complete.\n",
    "- an IAM role that Textract will assume, granting access to the S3 bucket and SNS topic\n",
    "\n",
    "First, we bring in the `TextractWrapper` class provided in the [AWS Code Examples repository](https://github.com/awsdocs/aws-doc-sdk-examples/blob/main/python/example_code/textract/textract_wrapper.py). This class makes it simpler to interface with the Textract service."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "casiElly3G2C"
   },
   "outputs": [],
   "source": [
    "# source: https://github.com/awsdocs/aws-doc-sdk-examples/tree/main/python/example_code/textract\n",
    "\n",
    "# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.\n",
    "# SPDX-License-Identifier: Apache-2.0\n",
    "\n",
    "\"\"\"\n",
    "Purpose\n",
    "\n",
    "Shows how to use the AWS SDK for Python (Boto3) with Amazon Textract to\n",
    "detect text, form, and table elements in document images.\n",
    "\"\"\"\n",
    "\n",
    "import json\n",
    "import logging\n",
    "from botocore.exceptions import ClientError\n",
    "\n",
    "logger = logging.getLogger(__name__)\n",
    "\n",
    "\n",
    "# snippet-start:[python.example_code.textract.TextractWrapper]\n",
    "class TextractWrapper:\n",
    "    \"\"\"Encapsulates Textract functions.\"\"\"\n",
    "\n",
    "    def __init__(self, textract_client, s3_resource, sqs_resource):\n",
    "        \"\"\"\n",
    "        :param textract_client: A Boto3 Textract client.\n",
    "        :param s3_resource: A Boto3 Amazon S3 resource.\n",
    "        :param sqs_resource: A Boto3 Amazon SQS resource.\n",
    "        \"\"\"\n",
    "        self.textract_client = textract_client\n",
    "        self.s3_resource = s3_resource\n",
    "        self.sqs_resource = sqs_resource\n",
    "\n",
    "    # snippet-end:[python.example_code.textract.TextractWrapper]\n",
    "\n",
    "    # snippet-start:[python.example_code.textract.DetectDocumentText]\n",
    "    def detect_file_text(self, *, document_file_name=None, document_bytes=None):\n",
    "        \"\"\"\n",
    "        Detects text elements in a local image file or from in-memory byte data.\n",
    "        The image must be in PNG or JPG format.\n",
    "\n",
    "        :param document_file_name: The name of a document image file.\n",
    "        :param document_bytes: In-memory byte data of a document image.\n",
    "        :return: The response from Amazon Textract, including a list of blocks\n",
    "                 that describe elements detected in the image.\n",
    "        \"\"\"\n",
    "        if document_file_name is not None:\n",
    "            with open(document_file_name, \"rb\") as document_file:\n",
    "                document_bytes = document_file.read()\n",
    "        try:\n",
    "            response = self.textract_client.detect_document_text(\n",
    "                Document={\"Bytes\": document_bytes}\n",
    "            )\n",
    "            logger.info(\"Detected %s blocks.\", len(response[\"Blocks\"]))\n",
    "        except ClientError:\n",
    "            logger.exception(\"Couldn't detect text.\")\n",
    "            raise\n",
    "        else:\n",
    "            return response\n",
    "\n",
    "    # snippet-end:[python.example_code.textract.DetectDocumentText]\n",
    "\n",
    "    # snippet-start:[python.example_code.textract.AnalyzeDocument]\n",
    "    def analyze_file(\n",
    "        self, feature_types, *, document_file_name=None, document_bytes=None\n",
    "    ):\n",
    "        \"\"\"\n",
    "        Detects text and additional elements, such as forms or tables, in a local image\n",
    "        file or from in-memory byte data.\n",
    "        The image must be in PNG or JPG format.\n",
    "\n",
    "        :param feature_types: The types of additional document features to detect.\n",
    "        :param document_file_name: The name of a document image file.\n",
    "        :param document_bytes: In-memory byte data of a document image.\n",
    "        :return: The response from Amazon Textract, including a list of blocks\n",
    "                 that describe elements detected in the image.\n",
    "        \"\"\"\n",
    "        if document_file_name is not None:\n",
    "            with open(document_file_name, \"rb\") as document_file:\n",
    "                document_bytes = document_file.read()\n",
    "        try:\n",
    "            response = self.textract_client.analyze_document(\n",
    "                Document={\"Bytes\": document_bytes}, FeatureTypes=feature_types\n",
    "            )\n",
    "            logger.info(\"Detected %s blocks.\", len(response[\"Blocks\"]))\n",
    "        except ClientError:\n",
    "            logger.exception(\"Couldn't detect text.\")\n",
    "            raise\n",
    "        else:\n",
    "            return response\n",
    "\n",
    "    # snippet-end:[python.example_code.textract.AnalyzeDocument]\n",
    "\n",
    "    # snippet-start:[python.example_code.textract.helper.prepare_job]\n",
    "    def prepare_job(self, bucket_name, document_name, document_bytes):\n",
    "        \"\"\"\n",
    "        Prepares a document image for an asynchronous detection job by uploading\n",
    "        the image bytes to an Amazon S3 bucket. Amazon Textract must have permission\n",
    "        to read from the bucket to process the image.\n",
    "\n",
    "        :param bucket_name: The name of the Amazon S3 bucket.\n",
    "        :param document_name: The name of the image stored in Amazon S3.\n",
    "        :param document_bytes: The image as byte data.\n",
    "        \"\"\"\n",
    "        try:\n",
    "            bucket = self.s3_resource.Bucket(bucket_name)\n",
    "            bucket.upload_fileobj(document_bytes, document_name)\n",
    "            logger.info(\"Uploaded %s to %s.\", document_name, bucket_name)\n",
    "        except ClientError:\n",
    "            logger.exception(\"Couldn't upload %s to %s.\", document_name, bucket_name)\n",
    "            raise\n",
    "\n",
    "    # snippet-end:[python.example_code.textract.helper.prepare_job]\n",
    "\n",
    "    # snippet-start:[python.example_code.textract.helper.check_job_queue]\n",
    "    def check_job_queue(self, queue_url, job_id):\n",
    "        \"\"\"\n",
    "        Polls an Amazon SQS queue for messages that indicate a specified Textract\n",
    "        job has completed.\n",
    "\n",
    "        :param queue_url: The URL of the Amazon SQS queue to poll.\n",
    "        :param job_id: The ID of the Textract job.\n",
    "        :return: The status of the job.\n",
    "        \"\"\"\n",
    "        status = None\n",
    "        try:\n",
    "            queue = self.sqs_resource.Queue(queue_url)\n",
    "            messages = queue.receive_messages()\n",
    "            if messages:\n",
    "                msg_body = json.loads(messages[0].body)\n",
    "                msg = json.loads(msg_body[\"Message\"])\n",
    "                if msg.get(\"JobId\") == job_id:\n",
    "                    messages[0].delete()\n",
    "                    status = msg.get(\"Status\")\n",
    "                    logger.info(\n",
    "                        \"Got message %s with status %s.\", messages[0].message_id, status\n",
    "                    )\n",
    "            else:\n",
    "                logger.info(\"No messages in queue %s.\", queue_url)\n",
    "        except ClientError:\n",
    "            logger.exception(\"Couldn't get messages from queue %s.\", queue_url)\n",
    "        else:\n",
    "            return status\n",
    "\n",
    "    # snippet-end:[python.example_code.textract.helper.check_job_queue]\n",
    "\n",
    "    # snippet-start:[python.example_code.textract.StartDocumentTextDetection]\n",
    "    def start_detection_job(\n",
    "        self, bucket_name, document_file_name, sns_topic_arn, sns_role_arn\n",
    "    ):\n",
    "        \"\"\"\n",
    "        Starts an asynchronous job to detect text elements in an image stored in an\n",
    "        Amazon S3 bucket. Textract publishes a notification to the specified Amazon SNS\n",
    "        topic when the job completes.\n",
    "        The image must be in PNG, JPG, or PDF format.\n",
    "\n",
    "        :param bucket_name: The name of the Amazon S3 bucket that contains the image.\n",
    "        :param document_file_name: The name of the document image stored in Amazon S3.\n",
    "        :param sns_topic_arn: The Amazon Resource Name (ARN) of an Amazon SNS topic\n",
    "                              where the job completion notification is published.\n",
    "        :param sns_role_arn: The ARN of an AWS Identity and Access Management (IAM)\n",
    "                             role that can be assumed by Textract and grants permission\n",
    "                             to publish to the Amazon SNS topic.\n",
    "        :return: The ID of the job.\n",
    "        \"\"\"\n",
    "        try:\n",
    "            response = self.textract_client.start_document_text_detection(\n",
    "                DocumentLocation={\n",
    "                    \"S3Object\": {\"Bucket\": bucket_name, \"Name\": document_file_name}\n",
    "                },\n",
    "                NotificationChannel={\n",
    "                    \"SNSTopicArn\": sns_topic_arn,\n",
    "                    \"RoleArn\": sns_role_arn,\n",
    "                },\n",
    "            )\n",
    "            job_id = response[\"JobId\"]\n",
    "            logger.info(\n",
    "                \"Started text detection job %s on %s.\", job_id, document_file_name\n",
    "            )\n",
    "        except ClientError:\n",
    "            logger.exception(\"Couldn't detect text in %s.\", document_file_name)\n",
    "            raise\n",
    "        else:\n",
    "            return job_id\n",
    "\n",
    "    # snippet-end:[python.example_code.textract.StartDocumentTextDetection]\n",
    "\n",
    "    # snippet-start:[python.example_code.textract.GetDocumentTextDetection]\n",
    "    def get_detection_job(self, job_id):\n",
    "        \"\"\"\n",
    "        Gets data for a previously started text detection job.\n",
    "\n",
    "        :param job_id: The ID of the job to retrieve.\n",
    "        :return: The job data, including a list of blocks that describe elements\n",
    "                 detected in the image.\n",
    "        \"\"\"\n",
    "        try:\n",
    "            response = self.textract_client.get_document_text_detection(JobId=job_id)\n",
    "            job_status = response[\"JobStatus\"]\n",
    "            logger.info(\"Job %s status is %s.\", job_id, job_status)\n",
    "        except ClientError:\n",
    "            logger.exception(\"Couldn't get data for job %s.\", job_id)\n",
    "            raise\n",
    "        else:\n",
    "            return response\n",
    "\n",
    "    # snippet-end:[python.example_code.textract.GetDocumentTextDetection]\n",
    "\n",
    "    # snippet-start:[python.example_code.textract.StartDocumentAnalysis]\n",
    "    def start_analysis_job(\n",
    "        self,\n",
    "        bucket_name,\n",
    "        document_file_name,\n",
    "        feature_types,\n",
    "        sns_topic_arn,\n",
    "        sns_role_arn,\n",
    "    ):\n",
    "        \"\"\"\n",
    "        Starts an asynchronous job to detect text and additional elements, such as\n",
    "        forms or tables, in an image stored in an Amazon S3 bucket. Textract publishes\n",
    "        a notification to the specified Amazon SNS topic when the job completes.\n",
    "        The image must be in PNG, JPG, or PDF format.\n",
    "\n",
    "        :param bucket_name: The name of the Amazon S3 bucket that contains the image.\n",
    "        :param document_file_name: The name of the document image stored in Amazon S3.\n",
    "        :param feature_types: The types of additional document features to detect.\n",
    "        :param sns_topic_arn: The Amazon Resource Name (ARN) of an Amazon SNS topic\n",
    "                              where job completion notification is published.\n",
    "        :param sns_role_arn: The ARN of an AWS Identity and Access Management (IAM)\n",
    "                             role that can be assumed by Textract and grants permission\n",
    "                             to publish to the Amazon SNS topic.\n",
    "        :return: The ID of the job.\n",
    "        \"\"\"\n",
    "        try:\n",
    "            response = self.textract_client.start_document_analysis(\n",
    "                DocumentLocation={\n",
    "                    \"S3Object\": {\"Bucket\": bucket_name, \"Name\": document_file_name}\n",
    "                },\n",
    "                NotificationChannel={\n",
    "                    \"SNSTopicArn\": sns_topic_arn,\n",
    "                    \"RoleArn\": sns_role_arn,\n",
    "                },\n",
    "                FeatureTypes=feature_types,\n",
    "            )\n",
    "            job_id = response[\"JobId\"]\n",
    "            logger.info(\n",
    "                \"Started text analysis job %s on %s.\", job_id, document_file_name\n",
    "            )\n",
    "        except ClientError:\n",
    "            logger.exception(\"Couldn't analyze text in %s.\", document_file_name)\n",
    "            raise\n",
    "        else:\n",
    "            return job_id\n",
    "\n",
    "    # snippet-end:[python.example_code.textract.StartDocumentAnalysis]\n",
    "\n",
    "    # snippet-start:[python.example_code.textract.GetDocumentAnalysis]\n",
    "    def get_analysis_job(self, job_id):\n",
    "        \"\"\"\n",
    "        Gets data for a previously started detection job that includes additional\n",
    "        elements.\n",
    "\n",
    "        :param job_id: The ID of the job to retrieve.\n",
    "        :return: The job data, including a list of blocks that describe elements\n",
    "                 detected in the image.\n",
    "        \"\"\"\n",
    "        try:\n",
    "            response = self.textract_client.get_document_analysis(JobId=job_id)\n",
    "            job_status = response[\"JobStatus\"]\n",
    "            logger.info(\"Job %s status is %s.\", job_id, job_status)\n",
    "        except ClientError:\n",
    "            logger.exception(\"Couldn't get data for job %s.\", job_id)\n",
    "            raise\n",
    "        else:\n",
    "            return response\n",
    "\n",
    "\n",
    "# snippet-end:[python.example_code.textract.GetDocumentAnalysis]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "kpF9UagY01-l"
   },
   "source": [
    "Next, we set up Textract and S3, and provide this to an instance of `TextractWrapper`.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "aetKe-lg0-58"
   },
   "outputs": [],
   "source": [
    "import boto3\n",
    "\n",
    "textract_client = boto3.client('textract')\n",
    "s3_client = boto3.client('s3')\n",
    "\n",
    "textractWrapper = TextractWrapper(textract_client, s3_client, None)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "U3Qr-MTL4dkT"
   },
   "source": [
    "We are now ready to make calls to Textract. At a high level, Textract has two modes: synchronous and asynchronous. Synchronous calls return the parsed output once it is completed. As of the time of writing (March 2024), however, multipage PDF processing is only supported [asynchronously](https://docs.aws.amazon.com/textract/latest/dg/sync.html). So for our purposes here, we will only explore the asynchronous route.\n",
    "\n",
    "Asynchronous calls follow the below process:\n",
    "\n",
    "1. Send a request to Textract with an SNS topic, S3 bucket, and the name (key) of the document inside that bucket to process. Textract returns a Job ID that can be used to track the status of the request\n",
    "2. Textract fetches the document from S3 and processes it\n",
    "3. Once the request is complete, Textract sends out a message to the SNS topic. This can be used in conjunction with other services such as Lambda or SQS for downstream processes.\n",
    "4. The parsed results can be fetched from Textract in chunks via the job ID."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "pb9AIE4v4Yww"
   },
   "outputs": [],
   "source": [
    "bucket_name = \"your-bucket-name\"\n",
    "sns_topic_arn = \"your-sns-arn\" # this can be found under the topic you created in the Amazon SNS dashboard\n",
    "sns_role_arn = \"sns-role-arn\" # this is an IAM role that allows Textract to interact with SNS\n",
    "\n",
    "file_name = \"example-drug-label.pdf\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "tPG-P7c79xln"
   },
   "outputs": [],
   "source": [
    "# kick off a text detection job. This returns a job ID.\n",
    "job_id = textractWrapper.start_detection_job(bucket_name=bucket_name, document_file_name=file_name,\n",
    "                                    sns_topic_arn=sns_topic_arn, sns_role_arn=sns_role_arn)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "M0e0hZZJ-yCf"
   },
   "source": [
    "Once the job completes, this will return a dictionary with the following keys:\n",
    "\n",
    "```dict_keys(['DocumentMetadata', 'JobStatus', 'NextToken', 'Blocks', 'AnalyzeDocumentModelVersion', 'ResponseMetadata'])```\n",
    "\n",
    "This response corresponds to one chunk of information parsed by Textract. The number of chunks a document is parsed into depends on the length of the document. The two keys we are most interested in are `Blocks` and `NextToken`. `Blocks` contains all of the information that was extracted from this chunk, while `NextToken` tells us what chunk comes next, if any.\n",
    "\n",
    "Textract returns an information-rich representation of the extracted text, such as their position on the page and hierarchical relationships with other entities, all the way down to the individual word level. Since we are only interested in the raw text, we need a way to parse through all of the chunks and their `Blocks`. Lucky for us, Amazon provides some [helper functions](https://github.com/aws-samples/textract-paragraph-identification/tree/main) for this purpose, which we utilize below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "9h-6yBBJCYC-"
   },
   "outputs": [],
   "source": [
    "def get_text_results_from_textract(job_id):\n",
    "    response = textract_client.get_document_text_detection(JobId=job_id)\n",
    "    collection_of_textract_responses = []\n",
    "    pages = [response]\n",
    "\n",
    "    collection_of_textract_responses.append(response)\n",
    "\n",
    "    while 'NextToken' in response:\n",
    "        next_token = response['NextToken']\n",
    "        response = textract_client.get_document_text_detection(JobId=job_id, NextToken=next_token)\n",
    "        pages.append(response)\n",
    "        collection_of_textract_responses.append(response)\n",
    "    return collection_of_textract_responses\n",
    "\n",
    "def get_the_text_with_required_info(collection_of_textract_responses):\n",
    "    total_text = []\n",
    "    total_text_with_info = []\n",
    "    running_sequence_number = 0\n",
    "\n",
    "    font_sizes_and_line_numbers = {}\n",
    "    for page in collection_of_textract_responses:\n",
    "        per_page_text = []\n",
    "        blocks = page['Blocks']\n",
    "        for block in blocks:\n",
    "            if block['BlockType'] == 'LINE':\n",
    "                block_text_dict = {}\n",
    "                running_sequence_number += 1\n",
    "                block_text_dict.update(text=block['Text'])\n",
    "                block_text_dict.update(page=block['Page'])\n",
    "                block_text_dict.update(left_indent=round(block['Geometry']['BoundingBox']['Left'], 2))\n",
    "                font_height = round(block['Geometry']['BoundingBox']['Height'], 3)\n",
    "                line_number = running_sequence_number\n",
    "                block_text_dict.update(font_height=round(block['Geometry']['BoundingBox']['Height'], 3))\n",
    "                block_text_dict.update(indent_from_top=round(block['Geometry']['BoundingBox']['Top'], 2))\n",
    "                block_text_dict.update(text_width=round(block['Geometry']['BoundingBox']['Width'], 2))\n",
    "                block_text_dict.update(line_number=running_sequence_number)\n",
    "\n",
    "                if font_height in font_sizes_and_line_numbers:\n",
    "                    line_numbers = font_sizes_and_line_numbers[font_height]\n",
    "                    line_numbers.append(line_number)\n",
    "                    font_sizes_and_line_numbers[font_height] = line_numbers\n",
    "                else:\n",
    "                    line_numbers = []\n",
    "                    line_numbers.append(line_number)\n",
    "                    font_sizes_and_line_numbers[font_height] = line_numbers\n",
    "\n",
    "                total_text.append(block['Text'])\n",
    "                per_page_text.append(block['Text'])\n",
    "                total_text_with_info.append(block_text_dict)\n",
    "\n",
    "    return total_text, total_text_with_info, font_sizes_and_line_numbers\n",
    "\n",
    "def get_text_with_line_spacing_info(total_text_with_info):\n",
    "    i = 1\n",
    "    text_info_with_line_spacing_info = []\n",
    "    while (i < len(total_text_with_info) - 1):\n",
    "        previous_line_info = total_text_with_info[i - 1]\n",
    "        current_line_info = total_text_with_info[i]\n",
    "        next_line_info = total_text_with_info[i + 1]\n",
    "        if current_line_info['page'] == next_line_info['page'] and previous_line_info['page'] == current_line_info[\n",
    "            'page']:\n",
    "            line_spacing_after = round((next_line_info['indent_from_top'] - current_line_info['indent_from_top']), 2)\n",
    "            spacing_with_prev = round((current_line_info['indent_from_top'] - previous_line_info['indent_from_top']), 2)\n",
    "            current_line_info.update(line_space_before=spacing_with_prev)\n",
    "            current_line_info.update(line_space_after=line_spacing_after)\n",
    "            text_info_with_line_spacing_info.append(current_line_info)\n",
    "        else:\n",
    "            text_info_with_line_spacing_info.append(None)\n",
    "        i += 1\n",
    "    return text_info_with_line_spacing_info"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "McBARKH_ClEU"
   },
   "source": [
    "We feed in the Job ID from before into the function `get_text_results_from_textract` to fetch all of the chunks associated with this job. Then, we pass the resulting list into `get_the_text_with_required_info` and `get_text_with_line_spacing_info` to organize the text into lines.\n",
    "\n",
    "Finally, we can concatenate the lines into one string to pass into our downstream RAG pipeline."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "0CieHFURIhBm"
   },
   "outputs": [],
   "source": [
    "all_text = \"\\n\".join([line[\"text\"] if line else \"\" for line in text_info_with_line_spacing])\n",
    "\n",
    "with open(f\"aws-parsed-{source_filename}.txt\", \"w\") as f:\n",
    "  f.write(all_text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "fy8iAFpbggcD"
   },
   "source": [
    "#### Visualize the parsed document"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "3AwZQnFRghv8"
   },
   "outputs": [],
   "source": [
    "filename = \"aws-parsed-{}.txt\".format(source_filename)\n",
    "with open(\"{}/{}\".format(data_dir, filename), \"r\") as doc:\n",
    "    parsed_document = doc.read()\n",
    "\n",
    "print(parsed_document[:1000])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "LEO2tP6WJmtn"
   },
   "source": [
    "<a name=\"unstructured\"></a>\n",
    "### Solution 3: Unstructured.io [[Back to Solutions]](#top)\n",
    "\n",
    "Unstructured.io provides libraries with open-source components for pre-processing text documents such as PDFs, HTML and Word Documents.\n",
    "<br>\n",
    "External documentation: https://github.com/Unstructured-IO/unstructured-api"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "julG3fBrgmqQ"
   },
   "source": [
    "#### Parsing the document\n",
    "\n",
    "The guide assumes an endpoint exists that hosts this service. The API is offered in two forms\n",
    "1. [a hosted version](https://unstructured.io/)\n",
    "2. [an OSS docker image](https://github.com/Unstructured-IO/unstructured-api?tab=readme-ov-file#dizzy-instructions-for-using-the-docker-image)\n",
    "<br>\n",
    "**Note: You can skip to the next block if you want to use the pre-existing parsed version.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "5-jaa4-aJiJG"
   },
   "outputs": [],
   "source": [
    "import os\n",
    "import requests\n",
    "\n",
    "UNSTRUCTURED_URL = \"\" # enter service endpoint\n",
    "\n",
    "parsed_documents = []\n",
    "\n",
    "input_path = \"{}/{}.{}\".format(data_dir, source_filename, extension)\n",
    "with open(input_path, 'rb') as file_data:\n",
    "    response = requests.post(\n",
    "        url=UNSTRUCTURED_URL,\n",
    "        files={\"files\": (\"{}.{}\".format(source_filename, extension), file_data)},\n",
    "        data={\n",
    "            \"output_format\": (None, \"application/json\"),\n",
    "            \"stratergy\": \"hi_res\",\n",
    "            \"pdf_infer_table_structure\": \"true\",\n",
    "            \"include_page_breaks\": \"true\"\n",
    "        },\n",
    "        headers={\"Accept\": \"application/json\"}\n",
    "    )\n",
    "\n",
    "parsed_response = response.json()\n",
    "\n",
    "parsed_document = \" \".join([parsed_entry[\"text\"] for parsed_entry in parsed_response])\n",
    "print(\"Parsed {}\".format(source_filename))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "5qI5Q85vTfxs"
   },
   "outputs": [],
   "source": [
    "\"\"\"\n",
    "Post process parsed document and store it locally.\n",
    "\"\"\"\n",
    "\n",
    "file_path = \"{}/{}-parsed-fda-approved-drug.txt\".format(data_dir, \"unstructured-io\")\n",
    "store_document(file_path, parsed_document)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "kOAYMvt1HlAj"
   },
   "source": [
    "#### Visualize the parsed document"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "cBr_roPTHonz"
   },
   "outputs": [],
   "source": [
    "filename = \"unstructured-io-parsed-{}.txt\".format(source_filename)\n",
    "with open(\"{}/{}\".format(data_dir, filename), \"r\") as doc:\n",
    "    parsed_document = doc.read()\n",
    "\n",
    "print(parsed_document[:1000])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "KF528hYyEQZx"
   },
   "source": [
    "<a name=\"llama\"></a>\n",
    "\n",
    "### Solution 4: LlamaParse [[Back to Solutions]](#top)\n",
    "\n",
    "LlamaParse is an API created by LlamaIndex to efficiently parse and represent files for efficient retrieval and context augmentation using LlamaIndex frameworks.\n",
    "<br>\n",
    "External documentation: https://github.com/run-llama/llama_parse"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "oBLNlEVXgshj"
   },
   "source": [
    "#### Parsing the document\n",
    "\n",
    "The following block uses the LlamaParse cloud offering. You can learn more and fetch a respective API key for the service [here](https://cloud.llamaindex.ai/parse).\n",
    "<br>\n",
    "Parsing documents with LlamaParse offers an option for two output modes both of which we will explore and compare below\n",
    "- Text\n",
    "- Markdown\n",
    "<br>\n",
    "\n",
    "**Note: You can skip to the next block if you want to use the pre-existing parsed version.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "hR_85riD9ApM"
   },
   "outputs": [],
   "source": [
    "import os\n",
    "from llama_parse import LlamaParse\n",
    "\n",
    "import nest_asyncio # needed to notebook env\n",
    "nest_asyncio.apply() # needed to notebook env\n",
    "\n",
    "llama_index_api_key = \"{API_KEY}\"\n",
    "input_path = \"{}/{}.{}\".format(data_dir, source_filename, extension)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "e4TskJQ4EdpA"
   },
   "outputs": [],
   "source": [
    "# Text mode\n",
    "text_parser = LlamaParse(\n",
    "    api_key=llama_index_api_key,\n",
    "    result_type=\"text\"\n",
    ")\n",
    "\n",
    "text_response = text_parser.load_data(input_path)\n",
    "text_parsed_document = \" \".join([parsed_entry.text for parsed_entry in text_response])\n",
    "\n",
    "print(\"Parsed {} to text\".format(source_filename))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "GKca2BfI9X5O"
   },
   "outputs": [],
   "source": [
    "\"\"\"\n",
    "Post process parsed document and store it locally.\n",
    "\"\"\"\n",
    "\n",
    "file_path = \"{}/{}-text-parsed-fda-approved-drug.txt\".format(data_dir, \"llamaparse\")\n",
    "store_document(file_path, text_parsed_document)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "Mjig7g-f89PQ"
   },
   "outputs": [],
   "source": [
    "# Markdown mode\n",
    "markdown_parser = LlamaParse(\n",
    "    api_key=llama_index_api_key,\n",
    "    result_type=\"markdown\"\n",
    ")\n",
    "\n",
    "markdown_response = markdown_parser.load_data(input_path)\n",
    "markdown_parsed_document = \" \".join([parsed_entry.text for parsed_entry in markdown_response])\n",
    "\n",
    "print(\"Parsed {} to markdown\".format(source_filename))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "upmHMI8SLcXZ"
   },
   "outputs": [],
   "source": [
    "\"\"\"\n",
    "Post process parsed document and store it locally.\n",
    "\"\"\"\n",
    "\n",
    "file_path = \"{}/{}-markdown-parsed-fda-approved-drug.txt\".format(data_dir, \"llamaparse\")\n",
    "store_document(file_path, markdown_parsed_document)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "P88GmKarHsTQ"
   },
   "source": [
    "#### Visualize the parsed document"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "Y0Cs-a03Huob"
   },
   "outputs": [],
   "source": [
    "# Text parsing\n",
    "\n",
    "filename = \"llamaparse-text-parsed-{}.txt\".format(source_filename)\n",
    "\n",
    "with open(\"{}/{}\".format(data_dir, filename), \"r\") as doc:\n",
    "    parsed_document = doc.read()\n",
    "    \n",
    "print(parsed_document[:1000])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "XW6SIVEg_fRN"
   },
   "outputs": [],
   "source": [
    "# Markdown parsing\n",
    "\n",
    "filename = \"llamaparse-markdown-parsed-fda-approved-drug.txt\"\n",
    "with open(\"{}/{}\".format(data_dir, filename), \"r\") as doc:\n",
    "    parsed_document = doc.read()\n",
    "    \n",
    "print(parsed_document[:1000])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "8QQ_RopYZ_mf"
   },
   "source": [
    "<a name=\"pdf2image\"></a>\n",
    "\n",
    "### Solution 5: pdf2image + pytesseract [[Back to Solutions]](#top)\n",
    "\n",
    "The final parsing method we examine does not rely on cloud services, but rather relies on two libraries: `pdf2image`, and `pytesseract`. `pytesseract` lets you perform OCR locally on images, but not PDF files. So, we first convert our PDF into a set of images via `pdf2image`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "unM8RfqYgzWC"
   },
   "source": [
    "#### Parsing the document"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "4kt-bmbGeWsf"
   },
   "outputs": [],
   "source": [
    "from matplotlib import pyplot as plt\n",
    "from pdf2image import convert_from_path\n",
    "import pytesseract"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "5bE2Sebtf_eU"
   },
   "outputs": [],
   "source": [
    "# pdf2image extracts as a list of PIL.Image objects\n",
    "pages = convert_from_path(filename)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "ViLTaLPYeT-o"
   },
   "outputs": [],
   "source": [
    "# we look at the first page as a sanity check:\n",
    "\n",
    "plt.imshow(pages[0])\n",
    "plt.axis('off')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "mZhXeNTbBKU_"
   },
   "source": [
    "Now, we can process the image of each page with `pytesseract` and concatenate the results to get our parsed document."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "W1SNCEd91k7N"
   },
   "outputs": [],
   "source": [
    "label_ocr_pytesseract = \"\".join([pytesseract.image_to_string(page) for page in pages])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "tv9hLiJgBANJ",
    "outputId": "1c3bd5dc-cfe9-43e6-9cc7-bc695a210d31"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "HIGHLIGHTS OF PRESCRIBING INFORMATION\n",
      "\n",
      "These highlights do not include all the information needed to use\n",
      "IWILFIN™ safely and effectively. See full prescribing information for\n",
      "IWILFIN.\n",
      "\n",
      "IWILFIN™ (eflor\n"
     ]
    }
   ],
   "source": [
    "print(label_ocr_pytesseract[:200])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "TIwcDmjXejKV"
   },
   "outputs": [],
   "source": [
    "label_ocr_pytesseract = \"\".join([pytesseract.image_to_string(page) for page in pages])\n",
    "\n",
    "with open(f\"pytesseract-parsed-{source_filename}.txt\", \"w\") as f:\n",
    "  f.write(label_ocr_pytesseract)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "j0u1UfNdaFt4"
   },
   "source": [
    "#### Visualize the parsed document"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "HaJutaBkf9Yj"
   },
   "outputs": [],
   "source": [
    "filename = \"pytesseract-parsed-{}.txt\".format(source_filename)\n",
    "with open(\"{}/{}\".format(data_dir, filename), \"r\") as doc:\n",
    "    parsed_document = doc.read()\n",
    "\n",
    "print(parsed_document[:1000])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "SCbkT4oZSfs9",
    "jp-MarkdownHeadingCollapsed": true
   },
   "source": [
    "<a name=\"document-questions\"></a>\n",
    "## Document Questions\n",
    "\n",
    "We can now ask a set of simple + complex questions and see how each parsing solution performs with Command-R. The questions are\n",
    "- **What are the most common adverse reactions of Iwilfin?**\n",
    "  - Task: Simple information extraction\n",
    "- **What is the recommended dosage of IWILFIN on body surface area between 0.5 and 0.75?**\n",
    "  - Task: Tabular data extraction\n",
    "- **I need a succinct summary of the compound name, indication, route of administration, and mechanism of action of Iwilfin.**\n",
    "  - Task: Overall document summary"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "WSo0xA4vZAHV"
   },
   "outputs": [],
   "source": [
    "import cohere\n",
    "co = cohere.Client(api_key=\"{API_KEY}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "gyPYqKErY7ni"
   },
   "outputs": [],
   "source": [
    "\"\"\"\n",
    "Document Questions\n",
    "\"\"\"\n",
    "prompt = \"What are the most common adverse reactions of Iwilfin?\"\n",
    "# prompt = \"What is the recommended dosage of Iwilfin on body surface area between 0.5 m2 and 0.75 m2?\"\n",
    "# prompt = \"I need a succinct summary of the compound name, indication, route of administration, and mechanism of action of Iwilfin.\"\n",
    "\n",
    "\"\"\"\n",
    "Choose one of the above solutions\n",
    "\"\"\"\n",
    "source = \"gcp\"\n",
    "# source = \"aws\"\n",
    "# source = \"unstructured-io\"\n",
    "# source = \"llamaparse-text\"\n",
    "# source = \"llamaparse-markdown\"\n",
    "# source = \"pytesseract\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "BwdK7trMKykt",
    "jp-MarkdownHeadingCollapsed": true
   },
   "source": [
    "## Data Ingestion\n",
    "<a name=\"ingestion\"></a>\n",
    "\n",
    "In order to set up our RAG implementation, we need to separate the parsed text into chunks and load the chunks to an index. The index will allow us to retrieve relevant passages from the document for different queries. Here, we use a simple implementation of indexing using the `hnswlib` library. Note that there are many different indexing solutions that are appropriate for specific production use cases."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "46rPn1uoLQDa"
   },
   "outputs": [],
   "source": [
    "\"\"\"\n",
    "Read parsed document content and chunk data\n",
    "\"\"\"\n",
    "\n",
    "import os\n",
    "from langchain_text_splitters import RecursiveCharacterTextSplitter\n",
    "\n",
    "documents = []\n",
    "\n",
    "with open(\"{}/{}-parsed-fda-approved-drug.txt\".format(data_dir, source), \"r\") as doc:\n",
    "doc_content = doc.read()\n",
    "\n",
    "\"\"\"\n",
    "Personal notes on chunking\n",
    "https://medium.com/@ayhamboucher/llm-based-context-splitter-for-large-documents-445d3f02b01b\n",
    "\"\"\"\n",
    "\n",
    "\n",
    "# Chunk doc content\n",
    "text_splitter = RecursiveCharacterTextSplitter(\n",
    "    chunk_size=512,\n",
    "    chunk_overlap=200,\n",
    "    length_function=len,\n",
    "    is_separator_regex=False\n",
    ")\n",
    "\n",
    "# Split the text into chunks with some overlap\n",
    "chunks_ = text_splitter.create_documents([doc_content])\n",
    "documents = [c.page_content for c in chunks_]\n",
    "\n",
    "print(\"Source document has been broken down to {} chunks\".format(len(documents)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "lk4YgREV7LgC"
   },
   "outputs": [],
   "source": [
    "\"\"\"\n",
    "Embed document chunks\n",
    "\"\"\"\n",
    "document_embeddings = co.embed(texts=documents, model=\"embed-english-v3.0\", input_type=\"search_document\").embeddings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "xtG3eblo7Mkd",
    "outputId": "6dfb3a1f-d4a5-480b-b8e3-233950fce701"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Count: 115\n"
     ]
    }
   ],
   "source": [
    "\"\"\"\n",
    "Create document index and add embedded chunks\n",
    "\"\"\"\n",
    "\n",
    "import hnswlib\n",
    "\n",
    "index = hnswlib.Index(space='ip', dim=1024) # space: inner product\n",
    "index.init_index(max_elements=len(document_embeddings), ef_construction=512, M=64)\n",
    "index.add_items(document_embeddings, list(range(len(document_embeddings))))\n",
    "print(\"Count:\", index.element_count)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "YIncJz3qhWkg",
    "jp-MarkdownHeadingCollapsed": true
   },
   "source": [
    "## Retrieval\n",
    "\n",
    "In this step, we use k-nearest neighbors to fetch the most relevant documents for our query. Once the nearest neighbors are retrieved, we use Cohere's reranker to reorder the documents in the most relevant order with regards to our input search query."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "5rTZcKQ48tAJ"
   },
   "outputs": [],
   "source": [
    "\"\"\"\n",
    "Embed search query\n",
    "Fetch k nearest neighbors\n",
    "\"\"\"\n",
    "\n",
    "query_emb = co.embed(texts=[prompt], model='embed-english-v3.0', input_type=\"search_query\").embeddings\n",
    "default_knn = 10\n",
    "knn = default_knn if default_knn <= index.element_count else index.element_count\n",
    "result = index.knn_query(query_emb, k=knn)\n",
    "neighbors = [(result[0][0][i], result[1][0][i]) for i in range(len(result[0][0]))]\n",
    "relevant_docs = [documents[x[0]] for x in sorted(neighbors, key=lambda x: x[1])]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "vz8jbX8A9RO_"
   },
   "outputs": [],
   "source": [
    "\"\"\"\n",
    "Rerank retrieved documents\n",
    "\"\"\"\n",
    "\n",
    "rerank_results = co.rerank(query=prompt, documents=relevant_docs, top_n=3, model='rerank-english-v2.0').results\n",
    "reranked_relevant_docs = format_docs_for_chat([x.document[\"text\"] for x in rerank_results])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "KX0RYtx1HW_h",
    "jp-MarkdownHeadingCollapsed": true
   },
   "source": [
    "## Final Step: Call Command-R + RAG!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "beUFAEKnHYCQ"
   },
   "outputs": [],
   "source": [
    "\"\"\"\n",
    "Call the /chat endpoint with command-r\n",
    "\"\"\"\n",
    "\n",
    "response = co.chat(\n",
    "    message=prompt,\n",
    "    model=\"command-r\",\n",
    "    documents=reranked_relevant_docs\n",
    ")\n",
    "\n",
    "cited_response, citations_reference = insert_citations_in_order(response.text, response.citations, reranked_relevant_docs)\n",
    "print(cited_response)\n",
    "print(\"\\n\")\n",
    "print(\"References:\")\n",
    "print(citations_reference)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "lzJclPa6hnpi",
    "jp-MarkdownHeadingCollapsed": true
   },
   "source": [
    "## Head-to-head Comparisons\n",
    "\n",
    "Run the code cells below to make head to head comparisons of the different parsing techniques across different questions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "52wdFoILLy85"
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "results = pd.read_csv(\"{}/results-table.csv\".format(data_dir))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "l5WswlWbO4Da",
    "outputId": "ddf65c9d-ef12-4b07-a000-b8c1e6c1319e"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Question 1: What are the most common adverse reactions of Iwilfin?\n",
      "Question 2: What is the recommended dosage of Iwilfin on body surface area between 0.5 m2 and 0.75 m2?\n",
      "Question 3: I need a succinct summary of the compound name, indication, route of administration, and mechanism of action of Iwilfin.\n",
      "\n",
      "Pick which question you want to see (1,2,3):  3\n",
      "Do you want to see the references as well? References are long and noisy (y/n): n\n",
      "\n",
      "\n",
      "\n",
      "| gcp |\n",
      "\n",
      "\n",
      "Compound Name: eflornithine hydrochloride ([0], [1], [2]) (IWILFIN ([1])™)\n",
      "\n",
      "Indication: used to reduce the risk of relapse in adult and paediatric patients with high-risk neuroblastoma (HRNB) ([1], [3]), who have responded at least partially to prior multiagent, multimodality therapy. ([1], [3], [4])\n",
      "\n",
      "Route of Administration: IWILFIN™ tablets ([1], [3], [4]) are taken orally twice daily ([3], [4]), with doses ranging from 192 to 768 mg based on body surface area. ([3], [4])\n",
      "\n",
      "Mechanism of Action: IWILFIN™ is an ornithine decarboxylase inhibitor. ([0], [2])\n",
      "\n",
      "\n",
      "\n",
      "| aws |\n",
      "\n",
      "\n",
      "Compound Name: eflornithine ([0], [1], [2], [3]) (IWILFIN ([0])™)\n",
      "\n",
      "Indication: used to reduce the risk of relapse ([0], [3]) in adults ([0], [3]) and paediatric patients ([0], [3]) with high-risk neuroblastoma (HRNB) ([0], [3]) who have responded to prior therapies. ([0], [3], [4])\n",
      "\n",
      "Route of Administration: Oral ([2], [4])\n",
      "\n",
      "Mechanism of Action: IWILFIN is an ornithine decarboxylase inhibitor. ([1])\n",
      "\n",
      "\n",
      "| unstructured-io |\n",
      "\n",
      "\n",
      "Compound Name: Iwilfin ([1], [2], [3], [4]) (eflornithine) ([0], [2], [3], [4])\n",
      "\n",
      "Indication: Iwilfin is indicated to reduce the risk of relapse ([1], [3]) in adult and paediatric patients ([1], [3]) with high-risk neuroblastoma (HRNB) ([1], [3]), who have responded to prior anti-GD2 ([1]) immunotherapy ([1], [4]) and multi-modality therapy. ([1])\n",
      "\n",
      "Route of Administration: Oral ([0], [3])\n",
      "\n",
      "Mechanism of Action: Iwilfin is an ornithine decarboxylase inhibitor. ([1], [2], [3], [4])\n",
      "\n",
      "\n",
      "| llamaparse-text |\n",
      "\n",
      "\n",
      "Compound Name: IWILFIN ([2], [3]) (eflornithine) ([3])\n",
      "\n",
      "Indication: IWILFIN is used to reduce the risk of relapse ([1], [2], [3]) in adult and paediatric patients ([1], [2], [3]) with high-risk neuroblastoma (HRNB) ([1], [2], [3]), who have responded at least partially to certain prior therapies. ([2], [3])\n",
      "\n",
      "Route of Administration: IWILFIN is administered as a tablet. ([2])\n",
      "\n",
      "Mechanism of Action: IWILFIN is an ornithine decarboxylase inhibitor. ([0], [1], [4])\n",
      "\n",
      "\n",
      "| llamaparse-markdown |\n",
      "\n",
      "\n",
      "Compound Name: IWILFIN ([1], [2]) (eflornithine) ([1])\n",
      "\n",
      "Indication: IWILFIN is indicated to reduce the risk of relapse ([1], [2]) in adult and paediatric patients ([1], [2]) with high-risk neuroblastoma (HRNB) ([1], [2]), who have responded at least partially ([1], [2], [3]) to prior anti-GD2 immunotherapy ([1], [2]) and multiagent, multimodality therapy. ([1], [2], [3])\n",
      "\n",
      "Route of Administration: Oral ([0], [1], [3], [4])\n",
      "\n",
      "Mechanism of Action: IWILFIN acts as an ornithine decarboxylase inhibitor. ([1])\n",
      "\n",
      "\n",
      "| pytesseract |\n",
      "\n",
      "\n",
      "Compound Name: IWILFIN™ ([0], [2]) (eflornithine) ([0], [2])\n",
      "\n",
      "Indication: IWILFIN is indicated to reduce the risk of relapse ([0], [2]) in adult and paediatric patients ([0], [2]) with high-risk neuroblastoma (HRNB) ([0], [2]), who have responded positively to prior anti-GD2 immunotherapy and multiagent, multimodality therapy. ([0], [2], [4])\n",
      "\n",
      "Route of Administration: IWILFIN is administered orally ([0], [1], [3], [4]), in the form of a tablet. ([1])\n",
      " \n",
      "Mechanism of Action: IWILFIN acts as an ornithine decarboxylase inhibitor. ([0])\n",
      "\n",
      "\n"
     ]
    }
   ],
   "source": [
    "question = input(\"\"\"\n",
    "Question 1: What are the most common adverse reactions of Iwilfin?\n",
    "Question 2: What is the recommended dosage of Iwilfin on body surface area between 0.5 m2 and 0.75 m2?\n",
    "Question 3: I need a succinct summary of the compound name, indication, route of administration, and mechanism of action of Iwilfin.\n",
    "\n",
    "Pick which question you want to see (1,2,3):  \"\"\")\n",
    "references = input(\"Do you want to see the references as well? References are long and noisy (y/n): \")\n",
    "print(\"\\n\\n\")\n",
    "\n",
    "index = {\"1\": 0, \"2\": 3, \"3\": 6}[question]\n",
    "\n",
    "for src in [\"gcp\", \"aws\", \"unstructured-io\", \"llamaparse-text\", \"llamaparse-markdown\", \"pytesseract\"]:\n",
    "  print(\"| {} |\".format(src))\n",
    "  print(\"\\n\")\n",
    "  print(results[src][index])\n",
    "  if references == \"y\":\n",
    "    print(\"\\n\")\n",
    "    print(\"References:\")\n",
    "    print(results[src][index+1])\n",
    "  print(\"\\n\")"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "collapsed_sections": [
    "BoM6-Tq-Sm-1",
    "jLtrGjXiJE9M",
    "dlRxuddi5w0E",
    "1YCszLJnge4j",
    "LEO2tP6WJmtn",
    "julG3fBrgmqQ",
    "KF528hYyEQZx",
    "oBLNlEVXgshj",
    "8QQ_RopYZ_mf",
    "unM8RfqYgzWC",
    "BwdK7trMKykt"
   ],
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
